Getting started GenAI & LLM with my Udemy course, Hands-on Generative AI Engineering with Large Language Model 👇
Why instruction dataset?
Creating an instruction dataset tailored for fine-tuning a language model is a critical step in enhancing the model’s capabilities for specialized tasks. This guide walks through a concrete example of creating an instruction dataset.
Before creating your dataset, it’s essential to define the intended purpose. Are you building a chatbot, a story generator, or a question-answering system? Understanding the desired behavior of the model will guide the type and structure of data you are preparing.
Our objective is to
Create an instruction dataset, that is ready for fine-tunining a pretrained Large/Small Language Model to obtain a story generator dedicated for the 5-year-olds.
Choose raw dataset
To do so, for the sake of demonstration, we choose
- The raw dataset TinyStories, presented along with the paper TinyStories: How Small Can Language Models Be and Still Speak Coherent English? proposed by Romen Eldan and Yuanzhi Li. This dataset contains short stories that are synthetically generated by GPT-3.5 and GPT-4, only using a small vocabulary. That is very suitable for our intended 5-year-old readers. This dataset has 2 splits: train(2.12M rows) and validation(22K rows). In our use-case, I just use
trainsplit with 10K rows.
Let’s see how the TinyStories dataset looks like.
To obtain the instruction dataset, for each story in the TinyStories dataset, we need to generate synthetically an instruction sentence that corresponds to that story.
Implementation
Now, we have an idea about how the instruction dataset looks like. Let’s walk through the notebook step-by-step to create it.
First, we load required packages.
import concurrent.futures
import json
import re
from concurrent.futures import ThreadPoolExecutor
from typing import List, Tuple
from datasets import Dataset, load_dataset, concatenate_datasets
from openai import OpenAI
from tqdm.auto import tqdm
from google.colab import userdataThen, we write necessary functions to make our pipepline modular.
- The
get_story_listfunction is to create a list of stories from the raw dataset.
def get_story_list(dataset):
return [example['text'] for example in dataset]The
InstructionAnswerSetclass is to- Define a data structure to store and manage instruction-answer pairs, with methods to create instances from JSON
- Iterate over pairs.
class InstructionAnswerSet: def __init__(self, pairs: List[Tuple[str, str]]): self.pairs = pairs @classmethod def from_json(cls, json_str: str, story: str) -> 'InstructionAnswerSet': data = json.loads(json_str) pairs = [(data['instruction_answer'], story)] return cls(pairs) def __iter__(self): return iter(self.pairs)The
generate_instruction_answer_pairsfunction takes a story and OpenAI client as input to generate instruction-answer pairs using GPT-4. It processes the story through a carefully crafted prompt to create relevant instructions while following specific formatting requirements.
def generate_instruction_answer_pairs(story: str, client: OpenAI) -> List[Tuple[str, str]]:
prompt = f"""Based on the following story, generate an one-sentence instruction. Instruction \
must ask to write about a content the story.
Only use content from the story to generate the instruction. \
Instruction must never explicitly mention a story. \
Instruction must be self-contained and general. \
Example story: Once upon a time, there was a little girl named Lily. \
Lily liked to pretend she was a popular princess. She lived in a big castle \
with her best friends, a cat and a dog. One day, while playing in the castle, \
Lily found a big cobweb. The cobweb was in the way of her fun game. \
She wanted to get rid of it, but she was scared of the spider that lived there. \
Lily asked her friends, the cat and the dog, to help her. They all worked together to clean the cobweb. \
The spider was sad, but it found a new home outside. Lily, the cat, and \
the dog were happy they could play without the cobweb in the way. \
And they all lived happily ever after.
Example instruction: Write a story about a little girl named Lily who, \
with the help of her cat and dog friends, overcomes her fear of a spider to \
clean a cobweb in their castle, allowing everyone to play happily ever after. \
Provide your response in JSON format with the following structure:
{{"instruction_answer": "..."}}
Story:
{story}
"""
completion = client.chat.completions.create(model="gpt-4o-mini",
messages=[
{"role": "system",
"content": "You are a helpful assistant who \
generates instruction based on the given story. \
Provide your response in JSON format.",},
{"role": "user", "content": prompt},
],
response_format={"type": "json_object"},
max_tokens=1200,
temperature=0.7,)
result = InstructionAnswerSet.from_json(completion.choices[0].message.content, story)
# Convert to list of tuples
return result.pairs- Next, we wrap all the above atomic functions into a final function
create_instruction_datasetto create the instruction dataset.
def create_instruction_dataset(dataset: Dataset, client: OpenAI, num_workers: int = 4) -> Dataset:
stories = extract_substory(dataset)
instruction_answer_pairs = []
with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = [executor.submit(generate_instruction_answer_pairs, story, client) for story in stories]
for future in tqdm(concurrent.futures.as_completed(futures), total=len(futures)):
instruction_answer_pairs.extend(future.result())
instructions, answers = zip(*instruction_answer_pairs)
return Dataset.from_dict({"instruction": list(instructions), "output": list(answers)})Aferwards, the
mainfunction orchestrates the entire pipeline:- Initialize the OpenAI client
- Load the raw TinyStories dataset
- Create instruction dataset
- Perform train/test split
- Push the processed dataset to Hugging Face Hub
def main() -> Dataset:
# Initializes the OpenAI client
client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))
# Load the raw data
raw_dataset = load_dataset("roneneldan/TinyStories", split="train[:10000]")
# Create instructiondataset
instruction_dataset = create_instruction_dataset(raw_dataset, client)
# Train/test split and export
filtered_dataset = instruction_dataset.train_test_split(test_size=0.1)
# Push the processed dataset to Hugging Face Hub
filtered_dataset.push_to_hub("tanquangduong/TinyStories_Instruction")- Finally, we authenticate with Hugging Face Hub and execute the main pipeline to generate and upload our instruction dataset to the Hugging Face Hub.
from huggingface_hub import login
# Log in to the Hugging Face Hub
login(token=userdata.get('HF_TOKEN'))
# Launch the pipeline to create instruction dataset
main()Congratulations!
The resulting instruction dataset should look like this.
To wrap up, this guide first presents the role of instruction dataset in fine-tuning. To define the structure and format of instruction dataset, we first need to understand the intended purpose of fine-tuning. In our use-case, we leverage GPT-4o to create the instruction for each story. We customize prompt with some best practices in prompt engineering like precise instructions, 1-shot example, specified output format. The created instruction dataset is finally pushed to Hugging Face Hub for later use.